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Abstract 

We sequenced to near completion the entire mtDNA of 28 Sardinian goats, selected to represent the widest possible 
diversity of the most widespread mitochondrial evolutionary lineage, haplogroup (Hg) A. These specimens were reporters of 
the diversity in the island but also elsewhere, as inferred from their affiliation to each of 1 1 clades defined by D-loop 
variation. Two reference sequences completed the dataset. Overall, 206 variations were found in the full set of 30 
sequences, of which 23 were protein-coding non-synonymous single nucleotide substitutions. Many polymorphic sites 
within Hg A were informative for the reconstruction of its internal phylogeny. Bayesian and network clustering revealed a 
general similarity over the entire molecule of sequences previously assigned to the same D-loop clade, indicating 
evolutionarily meaningful lineages. Two major sister groupings emerged within Hg A, which parallel distinct geographical 
distributions of D-loop clades in extant stocks. The pattern of variation in protein-coding genes revealed an overwhelming 
role of purifying selection, with the quota of surviving variants approaching neutrality. However, a simple model of 
relaxation of selection for the bulk of variants here reported should be rejected. Non-synonymous diversity of Hg's A, B and 
C denoted that a proportion of variants not greater than that allowed in the wild was given the opportunity to spread into 
domesticated stocks. Our results also confirmed that a remarkable proportion of pre-existing Hg A diversity became 
incorporated into domestic stocks. Our results confirm clade Al 1 as a well differentiated and ancient lineage peculiar of 
Sardinia. 
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introduction 

The domestication of the goat [Capra hircus) has been one of the 
major landmarks in the evolution of human modes of subsistence 
[1,2,3]. Besides its valuable dairy products, its portability has made 
the goat the key element to ensure a stable human setdement in 
otherwise inhospitable habitats. 

The genetic traces of domestication in this species have been 
first addressed [4] by analyzing the diversity of the mitochondrial 
D-loop among individuals collected worldwide. These authors 
showed that the matrUines of the current breeds coalesce to three 
founders of the so-called haplogroups (Hg's) A, B and C, indicative 
of a limited number but not a singular drawing of female founders 
from wild populations. Later, the number of haplogroups grew to 
6 (A, B, C, D, F, G [5]), with the newly described Hg F as the most 
basal one. Each of these found close matches in the alleged 
ancestor of the domestic goat, the bezoar (C. aegagrus) [6] . 



While Hg A is the most derived lineage, it is also the most 
widespread and displays a fast and markedly star-shaped radiation, 
to be interpreted as the result of the success of animals bred in 
maternal isolation from wild population(s). Given the presence of 
this Hg in many breeds currently grown worldwide, its phyloge- 
ography could potentially be the essential piece of information to 
reconstruct the spread of maternal lineages. In particular it would 
help in distinguishing between ancient and recent displacements of 
stocks. This seems particularly useful as a further characterization 
of some locales, in addition to the assignment of a reasonable 
window of time for their introduction in a given area. This 
additional information can also be an added value whenever such 
locales are particularly adapted to a specific habitat, are valued as 
a recreational resource, or when the labeling of the products as 
"genuinely local" may enhance their economic relevance, thus 
calling for specific protection and conservation actions 
[7,8,9,10,11]. 
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In this context, however, the internal phylogeny of haplogroup 
A is poorly understood, mainly because of the hypermutability of 
the D-loop of mtDNA, which causes many positions to display the 
same nucleotide by the repeated occurrence of the same 
mutational event (identity by state) rather than by descent. The 
net result is a set of inextricable reticulations when one attempts to 
reconstruct internal, evolutionarily meaningful lineages. As an 
intermediate step forward, we [12] took advantage of the 
demography underlying a large proportion of the population of 
the Italian island of Sardinia. By sampling a large number of 
animals we identified, within Hg A, subgroups of D-loop 
sequences related in a star-like fashion, reminiscent of common 
descent from a limited number of founders. We then defined 1 1 
"ad interim" clades of Hg A, and listed a reduced number of 
mutations, among those occurring in the D-loop, that can 
distinguish them. Their frequencies ranged 1 .8-2 1 % among > 
1,500 Hg A individuals. By comparing our results with those of 
other screenings [5,13] we also realized that, with the exception of 
two clades {A-9 and A-11), the same groupings accounted for a 
large proportion of the stocks grown elsewhere. While this result 
can be explained by a relatively recent (historical) import of 
matrilines (possibly conveyed by different breeds) in Sardinia, the 
presence of pecuhar lineages might be due to a much earher 
introduction. These lineages might have been rare even in their 
homeland at the time of introduction in Sardinia and now lost 
outside the Island, or they might be the result of new mutations 
that occurred and were selected in loco. 

In summary, Sardinia recapitulates the D-loop phylogeny of a 
non trivial portion of Hg A and thus offers the opportunity of 
working out a clearer picture of divergence within the entire 
haplogroup. We then selected animals carrier of each of the 1 1 Hg 
A clades, plus an Hg C outgroup, and sequenced their mtDNA to 
near completion. In this way we could test our initial hypothesis 
that D-loop clades are predictors of evolutionarily meaningful 
lineages. This question is of general interest as far as the D-loop is 
the DNA region most commonly addressed to initially approach a 
number of issues of species' biology (see e.g. Chapter 21 in ref 
[14]). Also, finding a large number of variants along the entire 
mtDNA sequence can identify major evolutionary forces affecting 
livestock populations after domestication. This includes a better 
understanding of the selective regime(s) that impacted the 
mitochondrial genome. We then asked: i) whether our set of 
variants fitted a model of relaxed selection allowing for otherwise 
mildly deleterious alleles, as hypothesized for other species [15,16], 
and ii) whether a reduced number of mutational changes or 
assemblages of mutational changes was subjected to positive 
selection, possibly due to their potential physiological relevance. 

In the context of the resulting phylogenetic reconstruction, the 
clade A-11 of Hg A emerges as an ancient hneage, currendy 
peculiar of Sardinia, characterized by many non-synonymous 
substitutions. 

Results 

The sequence coverage attained in the 28 individuals selected 
for complete mtDNA resequencing (Table SI in File SI) is 
summarized in Figure SI, with colors indicating Hg A clade 
affiliation. With the exception of a single clade A8 individual (id. 
304) sequence coverage was always above 80%. For all clades at 
least one individual was covered at >89%. For the single Hg C 
individual only a 336 bp portion of the sequence was missing (98% 
coverage). 

Only two regions remained poorly covered in several individ- 
uals, i.e. those at the beginning of amphcon 3 and at the end of 



amplicon 31. In the first case the presence of a Cio stretch often 
caused slippage that generated overlapping shifted profiles further 
downstream (this problem was not observed in carriers of an 
interrupting T at pos. 1114). In the second case the complete 
coverage of the large amplicon 31 was attempted by sequencing in 
both directions, but profil(;s obtained with tlu' re\c'rse primer 
turned out to be often short or of poor quality and then discarded. 

As one of our aims was to explore the phylogenetic reliability of 
the relationships between Hg A clades defined by D-loop 
sequences, our entire analysis took into account only variations 
in the region spanning reference positions 1—15,430 (i.e. outside 
the D-loop). In addition to our Hg C individual (id. 985), all our 
analyses included sequences GU068049 (representative of Hg A 
from Mongolia) and GU295658 (representative of Hg B from 
South East Asia), thus reaching a total of 30 sequences hereafter 
referred to as the "full set". 

Overall, 206 variations were found in the full set of 30 sequences 
(Table S2 in File SI). Of these, 113 were contributed by die Hg C 
and B sequences, whereas 93 were found among Hg A sequences. 
This confirmed Hg's C and B as strongly divergent lineages, but 
the number of residual variable positions denoted that many 
polymorphic sites within Hg A could be informative for the 
reconstruction of its internal pliylogcn)'. In fact, 59 of tlu; 93 
variations were singletons in Hg A individuals and the remaining 
34 were shared by two or more individuals. Eleven variable 
positions were private for the reference sequence GU068049, 7 of 
which residing in regions covered in ref [17] and also reported 
therein. 

Among single nucleotide substitutions 193 and 8 were 
transitions and transversions, respectively. Among transitions T/ 
C changes outnumbered A/G changes (146 vs. 47). As to indels, 
two consecutive thymine residues reported for the reference 
sequence GU068049, were not found in any other sequence and 
were considered as two additional adjacent private single 
nucleotide indels (positions 177, 178). Two variations consisted 
in length changes in low complexity As (pos. 11560-7) and Cio 
(pos. 1 1 1 1-20) stretches. In particular this latter segment was On 
in Hg B but shortened and affected by other substitutions in Hg C. 
Finally, an A insertion was found in a single individual, between 
positions 1238 and 1239. 

Reconstructing Hg A phylogeny 

We used two independent methods to obtain a reconstruction of 
affinities between mtDNA sequences. The first method was a 
Bayesian tree reconstruction. The results are displayed in Figure 
S2. The relative order of branching of Hg's C and B replicates that 
of refs. [4,5,17] with highly supported nodes, and identifies Hg C 
as the most basal hneage. The next node (support 100%) defines 
the radiation of all Hg A lineages and separates the reference 
sequence GU068049 from the rest of Hg A sequences. A series of 
closely spaced nodes then separates small groupings. Each of these 
is formed by sequences clustered together with high support and 
previously assigned to the same D-loop clade. The only exceptions 
are represented by an A5 sequence (id. 1495) clustering with two 
A7 sequences, another A5 (id. 331) defining a single branch 
outside the clade of the remaining A5 sequences, and the two A8 
sequences (id's 830 and 304) which did not cluster together. Of the 
three sequences not previously assigned to any of the D-loop 
clades, two (id's 1332 and 480) were highly similar in the D-loop 
and clustered together in the tree, while the third (id. 692) was 
divergent in the D-loop and clustered with A9 in the tree. 

In summar)', this clustering scheme denoted a general similarity 
over the entire molecule of sequences previously assigned to the 
same D-loop clade. As to the relationships among clades, these 
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remained less strongly supported and pardy contradicted our 
previous reconstruction (see Fig. 2 in [12]), most likely because of 
the star-like and rapid divergence between them, and the presence 
of nucleotide states generated by recurrent mutational events (see 
below). The 27 Sardinian Hg A sequences here examined 
clustered into two sister groupings downstream to a poorly (0.21) 
supported node. However we noticed two features of this partition 
that may strengthen its significance. First, it separated the 
sequences in the same manner as a major hiatus in the D-loop 
network (clades Al to A6 plus A9 in Fig. S2, bottom vs. A7 to Al 1 
in Fig. S2, top) and second, it had a clear geographic correlate (Fig. 
S2). In particular, the sequences of the first group turned out to be 
associated with D-loop clades represented mainly in Europe when 
inspecting three datasets [5,13,18], with only a minority of 
observations from Asia. Conversely, the sequences of the second 
group appeared to be associated with three clades (A7, A8 and 
A 10) abundant throughout Asia. Sequences corresponding to D- 
loop clade Al 1 were assigned to this latter group (Fig. S2, top). 
Indeed a single instance of the haplotypic arrangement of D-loop 
variants characteristic of clade Al 1 was represented among the 
2488 individuals in the datasets of refs. [5,13], and it was sampled 
in Italy. Conversely, its common presence in Sardinia was 
confirmed in an independent study [18]. 

We also constructed a network representing the arrangement of 
allele states among our sequences (Fig. 1). Since we excluded some 
positions, the topology of the network turned out to be simpler 
than the Bayesian tree. Nevertheless, the branching order of Hg's 
C, B and A was replicated (Fig. 1, inset), with a large multi- 
furcation of all Hg A sequences. The network also replicated the 
clustering of sequences according to their D-loop clade affihation 
already evident in the Bayesian tree, with a single exception 
represented by id. 304 (clade A8) which now appeared at the tip of 
a long branch distal to an A5 sequence (Fig. 1, bottom right). 
Interestingly, one short branch of the multifurcation led to a node 
basal to all sequences of clades A8, A7, AlO and Al 1 (plus A5, id 
1495), i.e. one of the major groupings in the Bayesian tree. The 
network construction identified 7 positions with recurrent muta- 
tional events (Table S2 in File SI). 

Exploring possible selection regimes 

We searched for genetic traces left by selection regimes in the 
whole set of lineages here considered by analyzing separately 
protein-coding and RNA-coding genes. 

As to protein-coding genes, we argued that if variants in our set 
were largely neutral, they should be found in all genes and in 
numbers proportional to the targetable positions in each gene. 
This expectation was verified, with a linear and highly correlated 
increase of variants as a function of gene length, both including 
(R^ = 0.71; p = 3xlO"'*) and excluding (R^ = 0.52; p = 5xl0"^) 
the large number of variants contributed by the Hg B and C 
sequences (Fig. 2A). This observation was in large part attributable 
to variants in third codon positions, which accumulated linearly 
(Fig. 2B) with gene length in both the full set of 30 sequences 
(R^ = 0.81; p = 3xl0"^) as weU as in the 28 Hg A sequences 
(R^ = 0.51; p = 6x10 % In both cases co;c/// displayed an unusual 
accumulation of variants. 

The linearity between the number of variants and gene length 
was somewhat disrupted for non-synonymous variants (Fig. 2C), 
with low correlations for both the full set of 30 sequences 
(R^ = 0.06, p = 0.41) and tiie 28 Hg A sequences (R^ = 0.08, 
p = 0.33). The highest absolute numbers of non-synonymous 
variants were in coxIII and atp6 (total length = 280 and 226 
codons, respectively) whereas the relatively long cox/ (5 14 codons) 
harboured no variants. 



In order to distinguish whether this result was due to stochastic 
variations in the small numbers of non-synonymous variants or 
differential selection across genes, we analyzed the data in a 
McDonald- Kreitman framework [19]. For each gene we then 
compared non-synonymous and synonymous variations found 
among Hg's A, B and C (polymorphic) or only in Hg A, with those 
recorded in the divergence with C. ibex. The results are shown in 
Table S3 in File SI. For none of the genes the McDonald- 
Kreitman test turned out to be significant, either considering the 
three major haplogroups or only Hg A. 

Thus, in summary, tlu' absolute number of non-synonymous as 
compared to synonymous variants was in line with a generalized 
purifying selective regime on the former ones, possibly with 
variable intensities across genes, as exemplified by the accumula- 
tion above average in coxIII (21.7% of non-synonymous variants). 
However, when compared with the wild outgroup, the proportions 
of the two types of variants showed that, quantitatively, the 
intensity of purification remained unaltered under domestication. 

The above results could not rule out that only a small subset of 
non-synonymous variants were impacted by the breeding practic- 
es. In fact, under domestication, selection on these variants may 
have become less stringent or null (see e.g. [15]), possibly as an 
effect of breeding practices that favored the survival of females and 
drove the demography of matrilines [20] . A second hypothesis is 
that diversifying selection operated by breeders maintained a 
repertoire of valuable phenotypes influenced by some of the non 
synonymous variants. 

We analyzed the occurrence of non-synonymous variants in the 
phylogeny, to rule out possible cumulative effects of \ ariants with 
subtie phenotypic effects. Of the 23 non-synonymous substitutions 
observed in the full set of 30 sequences (Table 1), 3 were peculiar 
of Hg C (Fig. 1, inset). For one of these Hg C carried the derived 
aminoacid state (atpd 62S), whereas for the remaining two Hg C 
carried the ancestral state [coxIII 168T and nd3 29A) and the 
derived alleles were shared by Hg's B and A (coxIII 168M and nd3 
29V). The excess of 7 non-synonymous vs. 20 synonymous 
changes on the branch leading to Hg B (matching the increased (O 
ratio reported in this Hg reported in [17]), was not significant 
(p = 0.06; Fisher exact test) as compared to the rest of variants in 
protein-coding genes. 

One substitution (atp6 43I>V) was shared by all Hg A 
sequences but not Hg's B and C and was then assigned to the 
branch basal to Hg A, with the derived state in this Hg in 
agreement with previous results [17]. 

Whithin Hg A, a non-synonymous substitution was found on 
each of the branches basal to sequences affiliated with D-loop 
clades Al, A2, A3, A7, AlO and All, while two were on the 
branch leading to the highly similar uncharacterized sequences id. 
1332 and 480. This co-occurrence of aminoacid substitutions in 
atp6 and represents a non-significant excess (p = .38, Fisher 

exact test) as compared to 4 and 5 suljstitutions in ('ach gene, 
respectively, over the 37 branches of the n('twork. Additional 
private non-synonymous substitutions were found for sequences 
affiliated with clades A8 (id. 304), A3, A5 and All. Finally, an 
additional non-synonymous substitution (nd5 586L>Q) was found 
in one of the 5 supplementary clade Al 1 individuals sequenced in 
amplicon 28. In summary', a non-synonymous substitution was 
peculiar of all Hg A sequences and additional 12 affected 18 of the 
28 Hg A sequences here considered, ruling out the selection of 
functionally highly diversified lineages. 

We also reasoned that the two selective regimes outiined above 
lead to different expectations as to the composition of the 
repertoire of non-synonymous variants, with an enrichment of 
mildly deleterious variants under relaxed selection vs. an absence 
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Figure 1. Median joining networit of 30 sequences sliowing tKie occurrence of 23 non-synonymous substitutions. Each circle 
represents a sequence. Branch length Is proportional to the number of mutations. Note that two A1 1 and two uncharacterlzed sequences are lumped 
Into single (larger) nodes (see Materials and Methods). The haplogroup or D-loop clade (for Hg A) affiliation of clusters of sequences Is shown In large 
empty characters. Aminoacid replacements are shown beside each branch In their polarized form. The Hg A portion of the network Is magnified while 
the original network Is In the Inset. A large lozenge Indicates the likely position of the root and was added In the drawing to partition three variants 
peculiar of Hg C. Their polarization was obtained by comparison with C. ibex. 
doi:l 0.1 371 /journal.pone.0095969.g001 



of detrimental variants under diversifying selection. In order to 
gain insights on the features of our array of non-synonymous 
variants, we explored the homologous positions in the largest 
collection of mtDNA mutations recorded so far for a single species, 
i.e. the MITOMAP database of Homo sapiens. For each DNA 
position this database reports the observed variant(s), the 
corresponding aminoacid change, the number of occurrences of 
each variant in GenBank (GB, to be used as a rough estimate of its 
degree of polymorphism) and the phenotypic consequences of 
each mutation, if available. The results are reported in columns 5- 
7 of Table 1. In nearly half of cases (12/23) one of the aminoacids 
observed among our variants was conserved in the orthologous 
position of the human reference sequence (rCRS). Four of our 
aminoacid substitutions were replicated exactly in humans. Three 
of these had 10, 9 and 6 records in GB, respectively, i.e. values 
above the 60th percentile of frequencies among the 7932 variants. 
Interestingly the fourth {nd4 T109A) was initially associated with 
the MELAS syndrome. However it was also found at high 
frequency in normal controls [21]. Its causative role in the disease 
is now questioned (http://omim.org/entry/516003). Among our 



animals, this variation was found in only one of the two A3 
individuals, and its frequency in the Sardinian stock at large 
remains to be determined. Overall, in the set of 13 protein-coding 
genes, the ratio of non-synonymous to synonymous variants 
recorded in MITOMAP approached that observed here only 
when rare variants were excluded (considering only variations 
found >40 times in GenBank, Fig. S3). 

In 8 of the 23 human orthologous positions (34.7%) only 
synonymous variants were recorded, a finding potentially denoting 
that aminoacid replacements at these positions are not tolerated. 
Indeed this observation does not deviate from expectations, as far 
as the proportions of positions displaying only synonymous 
variations across the human orthologues of the 10 genes listed in 
Table 1 range between 21.9% in atp6 and 63.7% in nd4. 

Finally, we considered the indicators of possible pathogenicity 
computed in ref [22] for all possible aminoacid substitutions in the 
human rCRS. The 10 substitutions that would replicate those 
observed in goats all display MutPred pathogenicity scores below 
the mode of the overall distribution and selection scores lower than 
1, indicative of effects much milder than mutations recognized to 
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Figure 2. Scatterplot of number of variants recorded as a function of length of eacli of tlie 13 protein-coding mtDNA genes. Filled 
dots: variants scored among all 30 sequences considered. Empty dots: variants scored among the 28 Hg A sequences. Panel A: all positions. Panel B: 
only 3'^'' codon positions. Panel C: only non-synonymous variants (note that the three black dots at 0 variants overlap and obscure empty dots), coxlll 
is indicated by an arrow. 
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be pathogenic. Even the nd4 T109A, the sole substitution which 
appears in the list of 75 pathogenic mutations, has scores as low as 
0.482 and 0.443 (Table 1). 

When feasible, we repeated the same type of comparative 
analyses for variants in RNA-codrng genes (Table 2). For variants 
in rRNA genes, 8 high-confidence orthologous positions displayed 
high frequency variants in human, too. The fact that for 5 
positions there are no variants recorded in MITOMAP is within 
expectations, as far as this database contains only 271 and 438 
variable positions for the 945 and 1569 bp of the 12S and 16S 
rRNA, respectively. 

Also the orthologues of variants residing in tRNA genes 
appeared to be highly polymorphic in human, with the replication 
of the alternative alleles in 7 out of 8 cases. 



Discussion 

In this work we proceeded a step forward in the description of 
the phyletic relationships between lineages belonging to a major 
component of the goat mitochondrial diversity, Hg A. We 
obtained this by leveraging the diversity represented in a single 
Italian region, where the goat is highly valued and its diversity 
preserved. 

We restricted our analysis to a portion of the molecule less 
prone to recurrent mutations than the highly variable D-loop (115 
variable positions in 1213 bp across our 30 sequences). Neverthe- 
less we showed that clusters of sequences defined by a selected 
subset of D-loop mutations [12] do match clusters defined by 
variants in the rest of the molecule. The topology connecting these 
clusters needs additional support, which can be obtained by 
completion of the typing of missing positions in our set. In 



Table 2. Comparative analysis of the substitutions in RNA-coding genes inferred in the full set of 30 sequences. 





Position 


Gene 


Allelic variants 


Homologous pos. in rCRS 


Variants in IVIITOIVIAP 


Occurrences in GB 


165 


1 TC rDM A 


T/C 


743C 


not present 




1 77 


1 TC 1-DM A 


T/del 


amb. align. ^ 






178 


1 "»C -DM A 

1 ib-rKNA 


T/del 


amb. align. ^ 






1 80 


1 -DM A 

1 2i-rKNA 


G/T 


amb. align. ^ 






181 


1 TC 1-DM A 


A/T 


amb. align. ^ 






1 96 


1 zb-inlNA 


T/C 


773T 


T-C 


1 


205 


1 TC 1-DM A 
1 Zb-fnlNA 


T/C 


782A 


not present 




213 


1 TC 1-DM A 


T/C 


790A 


not present 




605 


1 TC 1-DM A 


T/C 


1 1 87T 


T-C 


18 


944 


1 TC 1-DM A 


T/C 


1 520T 


T-C 


13 


1 026 


1 TC 1-DM A 


T/C 


1 601 C 


C-T 


2 


1114 


1 £C -DM A 


T/C 


amb. align. ^ 






1111-20 


16S-rRNA 


C,o/C„ 








1192 


16S-rRNA 


T/C 


1764C 


C-T 


1 


1 238/39 


16S-rRNA 


A/del 


amb. align. ^ 






1369 


16S-rRNA 


T/C 


amb. align. ^ 






1671 


1 6S-rRNA 


A/G 


2241 A 


not present 




1850 


16S-rRNA 


T/C 


241 7C 


C-G 


4 


2120 


1 6S-rRNA 


T/C 


2686G 


not present 




2190 


16S-rRNA 


A/G 


2756C 


C-A, C-T 


1, 5 


2208 


1 6S-rRNA 


T/C 


2774C 


not present 




2619 


16S-rRNA 


T/C 


3184C 


C-T 


1 


3771 


tRNA-GIn 


T/C 


4335C 


C-T 


9 


3892 


tRNA-Met 


T/C 


4454T 


T-A, T-C 


90, 80 


6894 


tRNA-Ser (UCN) 


T/C 


7468C 


C-T 


5 


7723 


tRNA-Lys 


A/G 


8315A 


A-G 


1 


9827 


tRNA-Arg 


T/C 


10421C 


C-T 


1 


9860 


tRNA-Arg 


T/C 


10454T 


T-C 


59 


9863 


tRNA-Arg 


A/G 


amb. align. ^ 






1 1 560-7 


tRNA-His 


Ag/Ag 


amb. align. ^ 






11569 


tRNA-His 


A/G 


12160A 


not present 




11624 


tRNA-Ser (AGY) 


T/C 


12214C 


not present 




11691 


tRNA-Leu(CUN) 


A/G 


12279A 


A-T 


1 


11702 


tRNA-Leu{CUN) 


A/G 


1 2290A 


A-G 


2 



^Ambiguous alignment. 

doi:1 0.1 371/journal.pone.0095969.t002 



PLOS ONE I www.plosone.org 



7 



April 2014 I Volume 9 | Issue 4 | e95969 



Goat mtDNA Haplogroup A 



particular, dubious assignments involve the lineages associated 
with the D-loop clades A5 and A8, which probably stiU contain a 
large amount of internal diversity. As to A9 we [12] already 

warned that this has to be considered mainly a paragroup, as it is 
defined only by ancestral alleles in the D-loop. 

Also, the inclusion of high quality complete Hg A mitogenomes 
from outside Sardinia in future analyses is imperative. In fact, only 
three (7214T/C, 9221T/C and 14518A/G) of the variable 
positions here reported were also present among the 62 positions 
variable in Hg A animals sampled mainly in eastern and south- 
eastern Asia [17], and allele states at two of these wer(; probably 
homoplasic. Interestingly, also in this latter study the peculiarity of 
the reference sequence GU068049 (from the Inner Mongolian 
breed) emerged, with four private variants and three variants 
shared with a single individual (sample ID G1307). These findings 
qualify the trees of the two studies as complementary subsections 
of the entire Hg A phylogeny, but also predict that northern Asia 
may represent a further and poorly explored reservoir of goat 
mitochondrial diversity. Only when a robust topolog)^ will be 
attained it will be possible to reconsider the diversity of the D-loop 
and to enumerate the recurrent events within it. 

With the above caveats in mind, a major branching among the 
Hg A sequences here reported separates lineages currendy found 
mosdy in western Eurasia from lineages common in the eastern 
Old World. Phylogeographic considerations in this species are 
weakened by its high portability by both land and sea since several 
millennia [23], which may have led to the seeding of widely distant 
regions of the world in a manner disconnected from phyletic 
affinities (see e.g. [24]). Despite this, the present study captured a 
partition probably reflecting an ancient split in the spread of stocks 
carrying Hg A from its likely homeland in Ea.stern Anatolia [6]. 
This hypothesis may serve to orient the search of markers in 
archeological material that documents that process. As far as 
genetics is concerned, D-loop clade assignment was feasible in at 
least 89% of 29 Hg A bezoars [6] and identified 17 and 8 carriers 
of clades A8 and A5, respectively (Table 1), corresponding to each 
of the major Hg A tree branches. While the former animals were 
largely sampled in Iran (14 vs. 3 in Turkey) the latter ones were 
from Iran and Turkey in equal proportions, replicating to some 
extent the same east-west imbalance in the distribution of domestic 
goat lineages. 

Our results also confirm the view that a remarkable proportion 
of pre-existing Hg A diversity became incorporated into domestic 
stocks. A date between 201 and 281 thousand years ago was 
proposed [4] for the coalescent of Hg's C, B and A. In our 
Bayesian tree Hg A occupies approximately 1 5 % of the total tree 
height, corresponding to a date far predating the beginning of 
domestication. Using the average rate proposed [25] for S"^"* codon 
positions in wild Caprinae results in a similar estimate (FUe SI). 
This discrepancy between the age of the molecules and that of the 
domestic stocks testifies of a process of domestication not 
accompanied by strong botflenecks and loss of matrUinear 
diversity, extending previous results obtained in the goat [6,17]. 
Speculations on which mechanism(s) produced this outcome 
include multiple drawings also for Hg A, from populations less 
differentiated that those harboring Hg's C and B. This could have 
occurred during a long period preceding fuU-blown domestication, 
in which semi-wild populations were simply managed [3,6,26,27], 
without interrupting gene flow with wild relatives. It is worth 
noting that this represents one of the extremes of a full spectrum of 
possibiUties. At the opposite end lies the taurine cattie {Bos taums) 
for which an extremely narrow bottieneck was inferred based on 
mtDNA data [10]. 



Overall, the pattern of variation in protein-coding genes 
revealed and conflrmed an overwhelming role of purifying 
selection [15,17], with the quota of surviving variants approaching 
neutrality. The features of the set of 23 non-synonymous variants 
did not provide any evidence for an enrichment in deleterious 
aminoacid replacements which have approached neutrality due to 
relaxed selection. Conversely, they rephcated a group of variants 
which have been found to reach polymorphic frequencies in 
humans [22]. Interestingly, a large proportion of non-synonymous 
substitutions were on branches with two or more sequences 
downstream (internal branches), an additional evidence of their 
neutral or quasi-neutral behaviour. The overaU proportions of 
synonymous variants here reported for Hg's A, B, C or Hg A only 
(0.865 and 0.838, respectively; Table S3 in File Si, bottom) closely 
approach the asymptotic values reported for a number of 
mammalian taxa in ref [28]. These estimate the constant 
corresponding to the observed proportion of synonymous muta- 
tions after the effect of purifying selection has become negUgible. 
In our dataset the applicability of formal tests on the drasticity of 
aminaocid replacements (e.g. as implemented in TreeSAAP, [29]) 
was limited by the reduced number of variants in each gene. 

A similar argument applies to RNA-coding genes, as weU. Here 
we found a number of variants per kilobase (1 1.5, 6.9 and 8.0 for 
12S rRNA, 16S rRNA and tRNA's, respectively) even lower than 
in protein-coding genes (15.0), in line with the strong purifying 
selection inferred in other species [15]. However, the few observed 
variants have polymorphic homologous variants in humans. 

In summary, a simple model of relaxation of selection for the 
bulk of variants here reported should be rejected. Rather, our 
analysis of non-synonymous diversity (which considered intra- vs. 
inter-species instead of intra- vs. inter-haplogroup [17] compar- 
isons) revealed that a proportion of variants not greater that that 
aUowed in the wild was given the opportunity to spread into 
domesticated stocks. This also excludes a regime of generaUzed 
diversifying selection. Mitochondrial phenotypes have been 
claimed to play a role in a number of life-history traits [30,31]. 
A reduction of metabolic efficienty was hypothesized to be 
favourable in a carnivore, the dog [16], but this seems hardly 
transferable to a ruminant, whose valued products are milk, fat 
and lean body mass and a high reproductive rate. Of main interest 
here are instead temperature and starvation resistance, in light of 
the new environmental conditions encountered by the goats along 
with their breeders' migrations. An approach to reveal possible 
positive selection on individual variants wiU be to measure 
diversity indexes in severed sequences from the same clade. This 
has not been feasible here, due to the highly biased criterion for 
inclusion of individuals in the study. 

Our results now confirmed that clade All, (previously 
recognized as peculiar of Sardinia [12,18] where it reaches 
frequencies as high as 8.6%), indeed identifies a lineage that is well 
differentiated along the entire mtDNA molecule and is not simply 
a derivative of one of the lineages confined to Europe. Based on 
the current distribution of mtDNA lineages mosdy related to it, the 
possible geographical range for the origins of clade Al 1 sequences 
is much wider than Europe alone. Moreover, based on the number 
of mutations distinguishing it, its antiquity is comparable to that of 
much more common lineages currendy spread from Europe to 
East Asia. This suggests that goats carriers of this lineage were 
relocated into the Island from populations that are now lost or not 
yet explored for mtDNA diversity. 
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Materials and Methods 

The individuals 

All activities carried out by the Regional Association of 
Sardinian Farmers (ARAS) officers, part of which related to this 
study, foUowed ethical guidelines for care and use of agricultural 
animals for research (EC Directive 86/609/EEC). For this study 
specific approval by a review board was not necessary, as none of 
the procedures used here met the criteria to define them 
experiments as defined in Artitie 2 of the cited Directi\(:. All 
animals of the prcsc-nt study belong to the series pre\'iously studied 
[12]. They were not re-sampled and the new data were generated 
on previously collected DNA, without any additional burden to 
the animal. 

Twenty-eight individuals were selected to represent all the 
clades reported in [12], based on D-loop diversity. In order to 
represent also the within-clade diversity, individuals belonging to 
the same clade were included only if they differed at some 
positions in the D-loop. Three individuals were included 
specifically to represent the unclassified D-loop sequences. 

Additional criteria for inclusion in the above group of 28 
individuals were: a general phenot\pe fitting that of the "Sarda" 
breed [32], and a geographical origin preferentially in the 
Ogliastra sub-region of Sardinia (see Fig. 1 in [12]). 

In order to confirm/dismiss a subset of mutations, additional 38 
individuals belonging to D-loop clades A2, A4, A5, A7, A8 and 
All were considered and sequenced for amplicons 10, 22 and 28 
(see below). This group, too, was selected to display the "Sarda" 
phenotype, but the geographical criterion was relaxed and 
representatives of other sub-regions were included. The essential 
data are reported in Table SI in File SI. 

Amplification and sequencing 

The entire mtDNA was subdivided into 33 partially overlapping 
amplicons (Fig. SI and Table S4 in File SI). Amplifications were 
performed in a total volume of 15 ul in the presence of Ix reaction 
buffer, l.,5 mM MgC12, 7.5 pmoles of each primer, 200 uM 
dNTPs, 1 ng of genomic DNA and 0.5 U Taq polymerase. 
Thermal profiles consisted of 5 min. at 96°C, 35 cycles of 30 sec. 
at 95°C, 30 sec. at the annealing temperatures reported in Table 
S4 in File Sland 1 min. at 72°C, followed by a final extension of 
10 min. at 72°C. Five ul of the PGR product were then incubated 
15 min. at 37°G with exosap, foUowed by 15 min. at 80' C. 
Sequencing reactions were performed with the ABI PRISM 
BigDye Terminator kit (vl.l. Applied Biosystems) in the presence 
of 1.7 pmoles of primer F (and R for amplicon 31) for 10 min. at 
96°C, 25 cycles of 5 sec. at 55°C and 2 min. at 60°C. Products 
were separated in a PRISM ABI 3130xL Genetic Analyzer. Low 
quality sequence profiles were discarded upon visual inspection. 
High quality profiles were aligned with the Lasergene SeqMan 
software together with sequences representing haplogroups A to D 
(accession nos. GU068049, GU229278, GU295658, GU229279, 
GU229280, GU229281). 

Throughout the text position numbering refers to sequence 
GU068049. All sequences are deposited in GenBank with Acc. n. 
KJ192209-KJ192236. 

Two individuals (id. 1237 and 1300) displayed ambiguous peaks 
at pos. 4587 (Fig. S4). As an unambiguous T was observed in 6 
additional individuals it was considered a real variant. However 
the T/C ambiguity (Y) was retained in all analyses involving id. 
1237 and 1300. 



Data analysis 

A Bayesian tree reconstruction was performed with BEAST 
[33]. We used all the 28 nearly complete sequences between 
positions 1 and 15,430, i.e. outside the D-loop, retaining all 
uncertainties present in the original data. In this analysis we 
included the reference sequences GU068049 and GU295658, 
obtained from an Hg A and an Hg B individual, respectively, but 
excluded the other sequences reported in ref [34] to be 
unexpectedly divergent or chimeric (final whole set of 30 
sequences). A number of preliminary runs were used to set the 
priors, a suitable starting tree and to explore the possibility of 
estimating absolute dates for the tree nodes (see File SI). In the 
final analysis we used a substitution rate of 1 to obtain branch 
lengths in mutational units. We then used a GTR+F+i substitution 
model with strict clock and constant population size for a 10- 
million step run, sampled every 10,000 steps. After discarding a 
50% bum-in, the results were inspected with Tracer, condensed 
with TreeAnnotator and visualized with TreeFIG as recom- 
mended in the user manual. 

In order to obtain an independent interpretation of the 
phylogentic relationships between sequences and to precisely 
locate mutations on branches, we constructed a median joining 
network [35]. In order to avoid a large number of unsupported 
reticulations, in this analysis we removed all positions not tvped in 
9 or more of the 28 individuals, plus 6 problematic positions (177, 
178, 605, 4606, 9600, 11567) in which the inferred number of 
state changes did not fit that immediately deduced upon visual 
inspection (due to the additional presence of missing data in some 
of the sequences). The statistics fJe produced by the program was 
used to identily positions undergoing multiple mutational changes 
along the network. The assignment of mutations to each branch is 
reported in Table S2 in File SI, with branches numbered as in Fig. 
S5. 

The assignment of D-loop sequences from external datasets to 
the same clades here used was based on the presence of haplotypic 
arrangements of variants described in Table 2 of [12]. For each of 
the cited datasets, the full set of sequences was aligned with a non- 
redundant subset of those in ref [12], and allelic states compared 
after renumbering. 

Synonymous and non-synonymous substitutions were identified 
by translating the DNA serjuences in MEGA [36]. The counts of 
non-synonymous and synonymous variants used in the McDon- 
ald-Kreitman tests [19] were obtained by aligning the DNA 
sequences of the 1 3 protein-coding genes in GU06B049 with the 
orthologues of Capra ibex isolate G1253 (Acc. n. AB743814.1 to 
AB743826.1). From the raw counts we excluded all positions for 
which the difference between G. ibex and C. hircus GU068049 was 
attributable to the presence of a derived polymorphic allele in the 
latter, thus leaving only variants fixed before the divergence of 
Hg's C, B and A. 

The complete list of mtDNA variants recorded at MITOMAP 
(http://www.mitomap.org/MITOMAP) as of Aug. 30, 2013 was 
downloaded and analyzed in a spreadsheet. This contains data for 
7932 DNA variants obtained from 18363 GenBank (GB) 
sequences with size greater than 14 kb. To ensure that orthologous 
positions at the aminoacid level were compared, alignments of 
goat and human protein sequences were obtained with GOBALT 
(http://www.ncbi.nlm.nih.gov/tools/cobalt/), using the protein 
id's in GU68049 and in the human revised Cambridge Reference 
Sequence (rGRS, Acc. n. NC_0 12920). For RNA-coding genes, a 
multiple alignment of GU068049, GU295658 and the rGRS 
sequences was generated with Glustal [37], with default settings. 
For each position variant in the goat, a surrounding range of 40 bp 



PLOS ONE I www.plosone.org 



9 



April 2014 | Volume 9 | Issue 4 | e95969 



Goat mtDNA Haplogroup A 



was re-aligned and visually inspected to exclude instances with 
multiple alignment solutions. 

The list of aJl possible aminoacid substitutions in the rCRS [22] 
was downloaded and analyzed in a spreadsheet. The use of this list 
was possible only for positions in which one of the goat variant 
aminoacids matched the homologue in rCRS. The list includes the 
MutPred pathogenicity score, which is determined by a set of 
features reflecting protein structure and its dynamics, the presence 
of functional residues, biases of amino acid sequence, and 
evolutionary conservation at the substitution site and in its 
neighborhood. This score ranges from 0 (null pathogenicity) to 1 
(maximum pathogenicity). Also included is the mtDNA selection 
score, defined as the ratio between the probability of observing a 
substitution of a given pathogenicity among all substitutions and 
the probability of observing it in three major clades of the human 
mtDNA tree. Values lower than 1 denote substitutions with likely 
weak or null detrimental effects. 

Statistical tests were performed with R. 

Supporting Information 

Figure SI Lmearized map of goat mtDNA (at bottom). 

The positions of the 33 amplicons used to sequence the entire 
molecule are shown in the lower panel as black segments. Note 
that Fragment 33 spans mtDNA position 1. A representation of 
the coverage of the entire mitogenome in 28 individuals is given in 
the top panel. Lines are colored according to D-loop clade 
affiliation (reported on the left, after individual id.). Grayed 
segments indicate portions not covered by sequencing or 
producing poor results. Individual id's are as in [12]. 
(JPG) 

Figure S2 Phylogenetic tree of 28 goat mtDNAs (posi- 
tions 1-15,430) plus two reference sequences, obtained 
with BEAST. Individual id's and D-loop clade affiliations are 

reported at the tips of the tree. The posterior for each node is 
shown next to it, in italics. The scale bar is in mutation units. The 
table at right shows the assignment of domestic goat sequences to 
each clade (based solely on D-loop sequences, according to criteria 
of ref [12]) and their geographic provenance. Weu = western 
Europe (ref [5]); Ceu = central Europe (ref [5]); Meu = 
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